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This study examined the usefulness o£ an evaluation 
procedure designed to measure petformance in spoken English. Rating 
involved assessment! of the^prose reading and conversation skills of 
51 first- year students at lifellington Teachers' College^ New Zealand* 
Specific topics of investigation included^ the consistency of "general 
impression** ratings between ^valuators, the extent^^ to vhiSh teachers 
can differentiate betveein factors on the rating scale, the degree, of 
correlation between assessment of prose rea&ing and .conversation, the 
performance differences between younger and older students, and 
differences betureen evaluator ratings in a live interview and in a 
taped session. Many factors were found to influenc^e the al^sessment of 
oral language — the personality of the jeval.uator> the number of 
evaluators used, and the administratife^racticability of the test 
instrument itself. Other findings. indicated that a high correlation 
ejcisted between ratings gf taped and live situations, that older 
Vjtudents performed better than did younger students, that a' fair 
oegree of consensus was achieved between eval.uators, and that prose 
reading and conversation were two different skills. (KS) 
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Assessment of Spoken English 



by Elizabeth Rolfe 



Introdoctlon 



E'lmphasis on the spoken word in our' English 
programmes has strengthened considerably over'^the last 
dec:ade. The National English'ISyllabus Committee 
reflects the change" by recommending that more attentiAin 
be given to the teaching and evaluation of speaking 
skillsy^lthough much research has been undertaken to - 
clarity problems oj reliability in assessing written 
English, little is known about the tecjinical aspects of 
assessing spoken English. Some work has been 
undehaken by Hitchman and\Vilkinson in England, 
and by Pountney/tvNew Zealand, but many problems 
rehiain unsolved. Some of these have been investigated 
by the ^uthor in a recent research project, summarized 
\ . briefly ^elow. 

Fifty seven first-year students at the Wellington 
Teachers* College were assessed on locally prepared tests 
^of Pr^se Reading and Conversation. For the Prose 
Reading e^ch student was required to read aloud two 
passages (chosen from six) — one dialogue and one 
' straight description. For the Conversation Test each 
student was required to talk briefly about one of a set of 
six photographs (see examples of test materiih). The 
testing sessions were tape3 and later marked by four . 
assessors in addition to an on-the-spot assessment by the 
examiner. The administration of the whole test t6ok 
about'lO mij^utes for each ^d'rtit, 6'minutes for 
selection of materials and 'thiftking time' for the sfudent, 
and 4 minutes for the actual test. 

For the first part of the test, the Prose Reading, the 
student chose two passages, and then had a couple of 
minntes to glance ov£r tHem before being asked to read 
aloud. For the second part. Conversation about a visual 
stimulus, the student was told that he Should attempt to 
develop a theme independently of the examiner. The 
examiner was there to ask a few standardized questions 
at the beginning'in order to get conversation started. 
Subsequently* the examiner was more of a sympathetic 
listener than an active participant in the conversation. 

The rating procedure involved marking on separate' 
factors (such as Interpretation and Delivery for Prose 
Readifng) aiid then marking for General Impression (s 
rating'scal^s for Prose-Reading and .Conversation).^ - 
Previous work has shown th^t many assessors prefi^^ \ 
. j^generat impression marking because it represents a 
unitary response to a student's spoken fenglish ^ 
iperformance. Such assessors would rather judge the ^ 
whole performance unfragmen ted. considering the whole 
to be much more than th6 sum of the separate parts. 

The main aim of the present study was to examine the'- 
reliability of evaluation in oral English. Subsidiary 
problem? investigated were: the consistency of general ^ 
impression marking, the extent tWhich teachers ca: 
difTerentij^te factors on the rating scale, the degree o 





correlation between.marks awarded for Prose Reading 
and Conversation, the difference between students just 
out of school and the«*more mature' students, sex 
ditferences in test perfii>rmahce, and finally the difference 

Nin score distributions of marks from the live situation 
^nd from tape recordinjgs. 

J ' ' : \ ■ 

The Results 

1. There was found to tje a fair amount of agreement 
between the mal-ks dflndividual assessors. The ^ 

' correlations clustered^arbund 0.6 which is similar to 
those found in the marking of English essay-type 
answers. The agreement was higheSTtti Prose 
Reading, particularly in the dialogue passages 
which required the sfudents to be more expressive. 

The extent of the markers' experience Was a 
factor affecting the consistency of assessments. Tlie 
two more experienced markers indicated a higher 
level of agreement with each other than did the two 
less experienced markers. 

« . ' . . 

2. The consistency of assessments increased with the 

number of independent markers involved in the*^ 
. evaluation. There was a noticeabl&increase in the 
mean correlations (0.62 to 0.69) when a single 
marker's assessments were compared with the 
average of pairs of markers. There was only a slight 
increase (0.69 to 0.72) when a third marker's 
assessments were added to the pool. 

3. General Impression marks were shown to be almost 
as consistei^t as the composite marks resulting from 
summation of marks on the separate rating factors. 
This suggests that the General Impression mark is . 
a satisfactory assessment on its own.—* 

The results revealed that there was a considerable 
amount of overlap between the marks given for 
interpretation and Delivery on the Prose Reading 
leStWVpparently teachers cannot effectively 
^differentiate between these two factors. Hpwever, it 
•^ay be justifiable to retain both as separate factors 
-tQite rated, provided markers receive suffirient 
. training in how to discriminate between t ^ 

The overlap betwfcen marks awarded i 
Analysis-Coijtent and Language on the 
Conversation was very high, indicating that the 
speech qualities that the assessors evaluated in both 
cases were more or less one and the same thing. 
This suggests^hatthe two factors should be * 
combined for rating purposes and renamed ^ 
Content-Langutige. 

The Delivery factor of Conversation proved to be 
the most independenj^^.e. teachers found it 
comparativelj^paS^w mark this as a separate / 
aspect of performance. ^ ' / 



4. 



t of Spoken Engllih by E. Rolfe. unpublished 
cnT/^ NZCER Research Report, 1975. Available from NZCER 
SSIiiLthrary. ^ 



5. Assessments from the tapes suggested that Prose 
/Reading and Conversation were two different skills. 

/ Good performance on one does not necessarily " 
indicate good performance on tbe other. 



6. Students who had left school for more than one year 
^ performed much better on the Spoken English test 

than fcUow students who were in their first year out 
of school. Thefact that the older age group ^ 
performed better is not surprising since the Spoken 
English tests appeared to favour those students 
with confidence, maturity and a well-developed 
personality. The words of one assessor were that 
**thc older students had more clarity of thought and 
speech** while another stated thdt **their confidence 
and firmer voices" were their main advantages over 
the yoUnger students. 

7. ^Thp current studjt revealed no significant sex 

differences in performance on the Spoken English 
test. Female students performed s\i^ghtly better in 
Prose Reading, but not in Conversation. 

8. There was no difference between the distribution of 
marks (i.e. the average mark or the spread of 
marks) from the live situation and those from 

* tape recordings. However, the comparison was 
based on the assessments of only one examiner. The 
correlation between marks from the live situation 
and>ihose from tape was high. If this finding is 
confirmed in other studies, it would have important 
implications for testing practice in this field. 

Conclanont 

Many factors affect theuvalidity t)f oral ass^rssment. The - 
quality of the marker is of supreme importance; an 
assessor should have the kind of personality that can 
calm an anxious student and encourage a shy student to 
talk-. Experience at Spoken English assessment is an 
advantage and so too is the opportunity to meet and * . 
discuss problems with other assessors. Standardization of 
test materials and conditions can also helpincrease the 
amount of agreement between markers. 

Administrative practicability is an important aspect of 
any testing programme. Spoken English tests are often 
regarded as impracticable because most are individual ^ 
tests, thus very time consuming. Hdwevcr the research 
suggests that it would be possible to have group ' 
discussion tests being video-taped or tape recorded and 
evaluated later. This may not actually cut down the time 
/actor but it may be a satisfactory way of simultahequsly 
involving more students. Also, if t^e tests are being 
recorded, the teacher-examiner can give full attention to 
improving the quality of the test situation without having 
to be pre-occupied with marking *on the spot*. The 
video-tape and tape.recorder both appe«r as means of 
making Spoken English assessmen^ore practicable. 
However, the technical equipment us^doiiust be of 
superior quality. - • ' ^ 

The research described above demonstrated that the 
pooled assessments of two markers ftbm the tapes of 
individual students is a most efficient way to gain reliable 
• evaluations of Spoken English. Thereffore. teachers 
should give^uch aSsessfti^ts on two or more different 
sub*tests(s]t(ah as Pros^Kea^ing^and Conversation); this 
should prefefably be dom twii^e during the year. There 
yirould be/tianeed to test airstitdents at the same time; 
O ther the4csting could be scattered throughout the 
E HJC hool year. The amount of test preparation to be done 



by students beforehand would b<?flH»iimal. thus 
minimising "cramming** and anxiety. All tests could be 
taped to enable at least a tecond assessment to be made. 
A sample of the taped tests could then be assessed by 
expert external assessors for purposes of hioderation, 
that is. determining comparable standards between 
schools. These external assessors could meet with 
teachers befor^ the assessments are made? in order to 
discuss the use of the rating scale(s) and the qualities of 
Spoken English to be evaluated. 

Although more researcJh is needed on some of these 
problems, enough is now known about the assessment of 
oral English for such recommendations to be made with - 
some confidence. ^ 

Footnote: 

Spoken English, fbr the purposes of fhe study discussed 
here, was defined as follows: 

i) the ability to read aloud passages of connected 
English prose and whilst doing so to reveal one*s own 
powers of interpretation and appreciation; 
ii) the ability to converse at some depth with an adult 
on a chosen subject. 

The student's power to communicate mood and ideas 
was relevant to th^se two dimensions of Spoken EngUsh. 
Also relevant was the student's command of language 
and his ability to pr^ent ideas with a pleasant voice and 
clear diction. ( 

Ideally any test of Spoken English should assess the 
wide range of speech situations that a person encounters 
An everyday living, such as casual greetings, 'small talk',^ 
conversation, grdup discussion, speech-making arid 
reading aloud — all with varied purpose and audience. 
To make the exercise priacticable, the Spoken English * 
tested In the present study included just two of these. 
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Rjiting Scale for Prone Reading 



Rating Scale for Converaatlon / 

(Revised as a consequence of* the nndingsof tn 
study) 



le NZCER 



|a] Interpretation 

10: Delivery indicates a good understanding of 
the passage — skilful phrasing, fluent 
rhythm, expressive intonation, flexible use 
of pace and pause. Mood appreciated and" 
communicated. Easy to listen to. 



7, 8 



V- 

Delivery indicates poor understanding of 
the passage^ phrases too long, too short, 
jerky or staccato rhythm; overdone 
intonation, flat, sing-song or otherwise 
monotonous intonation. Race too fast, too 
slow or arhythmic. No appreciation of 
mood. 



|bl' Delivery < Y^^^^^ (Mechanics) 
* ^ Diction 



9, 10; 



7, 
,5, 
2, 

0: 



Easily heard. Accurate pronunciation. 
Variety of intonation. Strong, ple'asant 
voice. Well-pitched. Clear crisp diction. 
Final, consonants adequately defined. - 
Unaffected. 



6: 



Inaudible or too loud. Inaccurate 
pronunciation (i.e. sounds omitted, 
substituted or added).. Monotonpus. Weak, 
husky, nasal. Pitch too high or top low. 
Careless, defective diction, 



[c| General Impression 



9, 10: 

7. 8 
4,5, 6 
2, 3 
0, 1 



Good cb^tent, language and delivery. 
Overall vpry effective communication. 



» Poor on all aspects of this spoken English 
test. Made no impact on the listeners). 



la| Analysis — Content (ideas) 

V. ^ ' 

9, 10: Spontaneous ^pd fluent presentation ot 
ideas. Content (Jf good quality, revealing 
some depth of thinking. Well-ordered, 
arrangement of ideas. Shows ability to 
develop a theme. Coherence of ideas.' 
Vocabulary-and structure suitable and of 
adequate rdngq. Ease of presentation. 
Convincing. 



7, 
5, 
2, 
0, 



Fiilds it difficult to say anything, or is 
verbose. Ideas shallow and superficial. . 
Ideas are mucjdied. Finds it difficult to 
develop a theme. Fails to keep to the point. 
Inadequate vocabulary. Uses slang 
inappropriately. Awkward presentation 
with too many pauses, false starts and gap- 
fillers. Lacking force. 



lb) Delivery <; 



Voice 
Diction 



(Mechanics) 



9, 10: 



7, 
5. 
2, 
0, 



Easily heard. Accurate pronunfiation. 
Variety of intonation. Strong/pleasant 
voice. Well-pitched. Clear crisp diction. 
Final consonants adequately defined. 
UnatTected. 



Inaudible or too loud. Inaccurate 
pronunciation (i.e. sounds omitted, 
substituted or added). Monotonous. Weak, 
husky, nasal.' Pitch too high or too low. 
Careless, defective diction. 



[c] General I^mpresslonl 



9, 10: 



7, 

2. 

0, 



Good content, language and delivery. 
Overall very effective communication. 



Poor on all aspects of this spoken English 
te*st. Made no impact on the listeneH^K 



Notat The rating for General Impression should be done after 
the rating on the other factors. The total effect of the prepared 
talk or conversation ts^hat is called for here. 
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Rating Scales are adapted from Hitchman. P.J.. E 
Spoken EiigUsh(Methven. London. 1970) 



